Introduction

The aim of this project is to develop a machine learning model that predicts the loan status of customers based on information provided in their application profile. It is a binary classification problem (a binary response) in which we predict whether a loan would be approved or not. I will be using open-source data pulled from Kaggle (pulled from an Analytics Vidhya Hackathon and implementing multiple techniques to yield the most accurate model for the problem.

Problem Statement

Loans are a necessity of the modern world, supporting consumption, economic growth, and business operations. Many types loans exist for different purposes across various stages of life, among which are home loans, which we intend to tackle in this problem.

Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customers can apply for a home loan after the company validates their eligibility. The company wants to automate the loan eligibility process (real-time) based on customer detail provided in their application. The company wants to identify customer segments that are eligible for loan amounts so that they can specifically target these customers.

Loan prediction is a very common real-life problem that every retail bank faces at least once; automating this process could save time, resources, and money. There is, however, an unambiguously bias in lending. As such, we seek to examine the factors most predictive of loan status and fit multiple models to automate loan eligibility.

Dataset description

The data files provided consists of a training set (train.csv) and test set (test.csv), which contains similar data-points as train except for the loan status to be predicted. The training set consists of 614 observations on 13 variables (8 categorical and 5 numeric); the testing consists of 367 observations on 12.

Since this project employs supervised learning methods, I intend to use only the training set. The dataset will be split into 70% training and 30% testing, using observed values to evaluate predictive accuracy.

Project outline

First, I will load the data, perform initial data manipulation and cleaning, and address missing values. Next, I will perform exploratory data analysis, employing visualization and inferential techniques to identify trends, patterns, and relationships. After examining the data, I will perform some final tidying before setting up the models. I will split the train.csv into a train and test set (70/30), build a recipe, and create validation sets to generate multiple estimations of the test error rate. We then fit 6 supervised learning models (logistic, LDA, QDA, elastic net, KNN, and pruned decision trees) before assessing their performance using several evaluation metrics. From there, we will select the best model and fit it to our testing data.

Exploratory Data Analysis

We begin by loading and examining the data and performing some initial data tidying and manipulation. We then utilize visualization tools to explore the dataset and make some final adjustments before continuing with our analysis.

Loading the data

loan_ds <- read.csv("~/Desktop/School/PSTAT/PSTAT 131/proj-final/project_data/train.csv")
str(loan_ds)
## 'data.frame':    614 obs. of  13 variables:
##  $ Loan_ID          : chr  "LP001002" "LP001003" "LP001005" "LP001006" ...
##  $ Gender           : chr  "Male" "Male" "Male" "Male" ...
##  $ Married          : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr  "0" "1" "0" "0" ...
##  $ Education        : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ Self_Employed    : chr  "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
##  $ CoapplicantIncome: num  0 1508 0 2358 0 ...
##  $ LoanAmount       : int  NA 128 66 120 141 267 95 158 168 349 ...
##  $ Loan_Amount_Term : int  360 360 360 360 360 360 360 360 360 360 ...
##  $ Credit_History   : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ Property_Area    : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ Loan_Status      : chr  "Y" "N" "Y" "Y" ...

Observations:

  • Credit_History, a factor, is encoded as numeric
  • Credit_History, a categorical variable, is encoded as numeric.
  • ApplicantIncome and CoapplicantIncome are monthly figures given in dollar amounts, while LoanAmount is given in terms of thousands. We ideally want these to be on the same scale.
colSums(is.na(loan_ds))  
##           Loan_ID            Gender           Married        Dependents 
##                 0                 0                 0                 0 
##         Education     Self_Employed   ApplicantIncome CoapplicantIncome 
##                 0                 0                 0                 0 
##        LoanAmount  Loan_Amount_Term    Credit_History     Property_Area 
##                22                14                50                 0 
##       Loan_Status 
##                 0

We observe missingness in the (numeric) variables LoanAmount, Loan_Amount_term and Credit_History. Note that is.na() does not detect blank entries in character variables; thus, we must employ a different methodology to identify empty strings.

sapply(loan_ds,function(x) table(as.character(x) =="")["TRUE"])
##           Loan_ID.NA          Gender.TRUE         Married.TRUE 
##                   NA                   13                    3 
##      Dependents.TRUE         Education.NA   Self_Employed.TRUE 
##                   15                   NA                   32 
##   ApplicantIncome.NA CoapplicantIncome.NA        LoanAmount.NA 
##                   NA                   NA                   NA 
##  Loan_Amount_Term.NA    Credit_History.NA     Property_Area.NA 
##                   NA                   NA                   NA 
##       Loan_Status.NA 
##                   NA

So we have blank entries in Gender, Married, Dependents, and Self-employed. We will first convert these blanks into NA’s for easy identification and determine how to handle them at a later step.

Initial tidying

Reload the dataset; reading blank entries as “NA.

loan_ds <- read.csv(file="~/Desktop/School/PSTAT/PSTAT 131/proj-final/project_data/train.csv",
                    header=TRUE,na.strings = c("",NA))   # read blanks as NA's 

Feature engineering: ApplicantIncome and CoapplicantIncome

  • Convert units into thousands of dollars to match that of LoanAmount.
loan_ds$ApplicantIncome <- (loan_ds$ApplicantIncome)/1000
loan_ds$CoapplicantIncome <- (loan_ds$CoapplicantIncome)/1000

Redundancy

  • Remove Loan_ID, a unique identifier, since it is not relevant to our analysis.
loan_ds <- loan_ds[,-1]; colnames(loan_ds)
##  [1] "Gender"            "Married"           "Dependents"       
##  [4] "Education"         "Self_Employed"     "ApplicantIncome"  
##  [7] "CoapplicantIncome" "LoanAmount"        "Loan_Amount_Term" 
## [10] "Credit_History"    "Property_Area"     "Loan_Status"

Data transformation

We convert categorical variables into factors.

# convert categorical variables into factor 
loan_ds$Gender <- factor(loan_ds$Gender, levels = c("Male","Female"))
i <- factor(loan_ds$Married, levels = c("Yes","No"))
loan_ds$Education <- factor(loan_ds$Education, levels = c("Graduate","Not Graduate"))
loan_ds$Self_Employed <- factor(loan_ds$Self_Employed, levels = c("Yes","No"))
loan_ds$Property_Area <- factor(loan_ds$Property_Area, levels = c("Rural","Semiurban","Urban"))
loan_ds$Loan_Status <- factor(loan_ds$Loan_Status, levels = c("Y","N"), labels = c("Yes","No")) 
loan_ds$Credit_History <- factor(loan_ds$Credit_History, levels = c(1,0), labels = c("Yes","No"))  
loan_ds$Dependents <- recode(loan_ds$Dependents,"3+"="3") %>%
  as.factor()
loan_ds$Married <- factor(loan_ds$Married)

Missing values

After transforming the data, we can now detect the correct amount of missingness in all variables.

colSums(is.na(loan_ds))
##            Gender           Married        Dependents         Education 
##                13                 3                15                 0 
##     Self_Employed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                32                 0                 0                22 
##  Loan_Amount_Term    Credit_History     Property_Area       Loan_Status 
##                14                50                 0                 0
vis_miss(loan_ds) # visualize missing data

The table below provides a numerical summary of missingness in our dataset, showing the number of missing values in each variable, percent of missingness, and cumulative sum of missingness (a running total).

missingness <- loan_ds %>%
  miss_var_summary(add_cumsum = TRUE) %>%
  dplyr::arrange(n_miss_cumsum) 

missingness
sum(missingness$pct_miss)  # total % of missingness in dataset 
## <pillar_num[1]>
## [1] 24.3

There are 149 total missing values (comprising ~24% of our dataset). One approach is to omit all missing values, or to remove variables with a lot of missingness. Another possible solution is to impute missing values where appropriate. Neither of these options are appropriate at this early of a stage in our analysis, since we do not yet know which variables are significant in prediction.

We will leave the dataset as is for now and continue with exploratory data analysis before returning to address missing values at a later step.

Variable description

  • Loan_ID : Unique Loan ID.
  • Gender : Male / Female.
  • Married : Whether the applicant is married (Yes/No)
  • Dependents : A factor indicating the number of dependents an applicant has (0,1,2,3).
  • Education : An applicant’s education level (Graduate/Under Graduate)
  • Self_Employed : Whether the applicant is self-employed (Yes/No)
  • ApplicantIncome : An applicant’s monthly income (in thousands of dollars).
  • CoapplicantIncome : A coapplicant’s monthly income (in thousands of dollars)
  • LoanAmount : The loan amount requested by an applicant (in thousands of dollars).
  • Loan_Amount_Term : The term of the loan in months.
  • Credit_History : Does the applicant’s credit history meet the bank’s requirements (Yes/No)?
  • Property_Area : An applicant’s area of residence (Urban/Semi Urban/Rural).
  • Loan_Status : Whether the loan was approved or not (Yes/No).

Visual EDA

This section consists of data exploration and visualization; we will first examine the response, generate a correlation matrix, and analyze the independent variables one by one to discern potential relationships.

Loan Status

First, we will look at the distribution of the response by creating a barplot.

loan_ds %>% 
  ggplot(aes(x = Loan_Status)) +
  geom_bar() + 
  theme_grey()  # create barplot 

loan_ds %>%
  select(Loan_Status) %>%
  table() %>%
  prop.table() 
## Loan_Status
##       Yes        No 
## 0.6872964 0.3127036

Approximately 69% of applicants were approved while 31% were rejected. This imbalance in classes may hinder our models’ ability to generate accurate predictions forLoan_Status. We will likely need to upsample or downsample the data at a later step.

Correlation plot

Examining dependency among independent variables is a crucial step in our analysis, providing insight into relationships, interactions, and potential issues such as multicolinearity.

The corrplot() function generates a graphical display of a correlation matrix, where the main diagonal are variances and the other cells are covariances. The sliding scale on the right-side of the plot illustrates the strength and direction of relationships for each pair. Note that corrplot() only accepts numeric variables.

The plot below illustrates the magnitude of correlation coefficients.

loan_ds %>%
  select(where(is.numeric)) %>%
  na.omit() %>%  
  cor() %>%
  corrplot(method="number")

We observe a moderate, positive correlation between LoanAmount and ApplicantIncome (0.57), and very little correlation (+/- 0.20) among the other numeric predictors (a good sign!). We will keep these findings in mind as we explore the dataset.

In the next few sections, we will analyze our predictors one-by-one to examine their distribution and relationship with each other and the response.

Loan Amount

We observe a right-skewed distribution, with most values falling between 0-400 thousand; a good insight into the average amount requested by each applicant. We also detect a few high outliers pulling the mean up. There is no significant variation between the average loan amounts requested by approved and rejected applicants; however, we see greater variation among the rejected applicants.

require(gridExtra)

plot1 <- loan_ds %>%
  na.omit(LoanAmount) %>%
  ggplot(aes(x=LoanAmount)) + 
  geom_histogram(bins=40) +
  theme_grey()

plot2 <- loan_ds %>%
  na.omit(LoanAmount) %>%
  ggplot(aes(Loan_Status, LoanAmount)) + 
  geom_boxplot(na.rm=T) +
  geom_jitter(alpha = 0.1) +
  theme_grey()

grid.arrange(plot1, plot2, ncol=2)

anova(aov(LoanAmount ~ Loan_Status, loan_ds))  # insignificant difference 

Recall that LoanAmount is positively correlated with Applicant_Income. A plot of LoanAmount by ApplicantIncome shows an approximately linear trend, indicating that applicants with higher income tend to request larger loans. When stratifying applicants based on Loan_Status, we observe similar approval rates across different loan amounts. However, it is difficult to discern whether applicant incomes are predictive of loan status. This brings us to our next section.

Applicant Income

Applicant monthly incomes range from 0.15 to 81 thousand dollars, with the majority of values falling between 5 and 7 thousand.

summary(loan_ds$ApplicantIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.150   2.877   3.812   5.403   5.795  81.000

We observe a right-skewed distribution; even with extreme outliers omitted, there’s a still an imbalance in the proportion of applicants falling within each income range. Average incomes among approved and rejected applicants are about the same.

plot1 <- loan_ds %>%
  filter(ApplicantIncome < 50) %>%    # omit high outliers 
  ggplot(aes(x=ApplicantIncome)) + 
  geom_histogram(fill="bisque",color="white",alpha=0.7, bins=10) + 
  geom_density() +
  geom_rug() + 
  labs(x = "applicant income") +
  theme_minimal()

plot2 <- loan_ds %>%
  filter(ApplicantIncome < 50) %>%  
  ggplot(aes(y=ApplicantIncome,x=Loan_Status, color=Loan_Status))+
  geom_boxplot() +
  theme_grey()

grid.arrange(plot1, plot2, ncol=2)

Upon inspecting the three high outliers (ApplicantIncome > 50), we find that:

  • 2 out of 3 applicants have 3 or more dependents, both of whom are self-employed;

  • 2 out of 3 applicants were approved for a loan; both reside in an urban area and have good credit history.

  • 2 out of 3 applicants are Males;

  • All 3 applicants have a graduate degree and have no coapplicant.

loan_ds %>%
  filter(ApplicantIncome > 50)

These observations indicate that neither ApplicantIncome or LoanAmount are not very predictive of Loan_Status in the most extreme cases, seeing that, the most affluent applicant requesting the smallest amount was rejected. In theory, we’d expect the opposite outcome.

We also observe that Property_Area and Credit_History are relevant factors even in applicants with very high incomes, seeing as the only applicant who was rejected resides in a rural area and has bad credit history.

It’s difficult to discern whether Education or Dependents affect loan status; we will examine these factors in the next sections.

For now, let’s take a closer look at approval rates. Applicants with monthly incomes of 15-20 thousand have the highest approval rate of ~77%. So, higher income do somewhat translate to a higher approval rate. However, we must note that there are varying amounts of data points in each bin, possibly inflating (or understating) approval rates.

Coapplicant Income

A density plot grouped by Loan_Status indicates that coapplicant incomes are right skewed, with a mix of high and low values in each status category. Average coapplicant incomes for approved applicants are slightly higher than that of rejected applicants.

loan_ds %>%
  ggplot(aes(x=Loan_Status, y=CoapplicantIncome, color=Loan_Status)) + 
  geom_boxplot()

It is important to note that many coapplicant incomes are 0 (no coapplicant) across both categories, skewing the mean down significantly. Omitting those values provide us with better insight into its central tendency. The below plot suggests a different conclusion that what we assumed previously. Of those with a coapplicant, applicants are approximately equally likely of being approved for a loan.

loan_ds %>%
  filter(CoapplicantIncome != 0) %>% 
  ggplot(aes(x=Loan_Status, y=CoapplicantIncome, color=Loan_Status)) + 
  geom_boxplot()

Next, we examine whether having a coapplicant itself affects loan status. The table below shows that 273 out of 614 (about 44%) applicants do not have a coapplicant; a relatively large count.

loan_ds %>%
  dplyr :: count(CoapplicantIncome == 0) 

The variable has_coapp has the value of FALSE if CoapplicantIncome is 0, and TRUE otherwise. A contingency table of Loan_Status and has_coapp indicates that applicants with coapplicants are more likely to be approved (71% vs 65%).

loan_ds %>%
  dplyr:: mutate(has_coapp = if_else(CoapplicantIncome != 0,TRUE,FALSE)) %>%
  group_by(has_coapp, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
# on average, ~72% of applicants w/ a coapplicant were approved for a loan
# while only ~65% of applicants w/o a coapplicant were approved.

Perhaps the presence of a coapplicant is more predictive of loan status than the numerical value of their income. We should consider transforming CoapplicantIncome into a factor.

Loan amount term

Loan_Amount_Term gives the term of the loan in months. From the plot below, we see that it has a left skew. This means that its mean (342 months ie., 28.5 years) is lower than its median and mode. Because there are so few points on the lower end, the mode is more representative of its center.

loan_ds %>%
  na.omit(Loan_Amount_Term) %>%
  ggplot(aes(x=Loan_Amount_Term)) + 
  geom_bar() +
  theme_grey() # mfv 360 

We now look at Loan_Amount_Term in relation to Loan_Status Applicants requesting short-term loans are more likely to be approved, on average, but we do see a high approval rate for 360. We should also keep in mind that ~85% of loans have a term of 360; the data may be under-representing applicants with alternative loan terms.

loan_ds %>%
  na.omit(Loan_Amount_Term) %>%  
  ggplot(aes(x=Loan_Amount_Term, fill=Loan_Status)) + 
  geom_bar(position="fill")

Dependents

Dependents is right skewed; most applicants have no dependents. Approval rates are similar across each category, with the highest approval rate for 2 dependents. There is no clear pattern; this indicates that Dependents may not be very influential in determining their Loan_Status.

plot1<-loan_ds %>%
  na.omit(Dependents) %>%  # should be able to impute this later
  ggplot(aes(x=Dependents)) +
  geom_bar()  # most applicants have no dependents

# dependents vs. loan status
plot2<-loan_ds %>%
  na.omit(Dependents) %>%  
  ggplot(aes(x=Dependents, fill=Loan_Status)) + 
  geom_bar(position="fill")
# relatively similar likelihood of approval for each # of dependents

grid.arrange(plot1,plot2,ncol=2)

A natural question is whether applicants with dependents (indicative of a larger household) request a larger loan. The boxplots below indeed suggest that individuals with dependents do, on average, request a larger loan amount; however, the actual number of dependents do not seem very significant.

Gender

We observe there are more male applicants than female applicants (81% vs 19%); thus females may be under represented. A natural question to ask is whether there is bias in the selection process. From the plots below, we can see that females are indeed less likely (about 8%) to be approved for a loan than males.

prop.table(table(loan_ds$Gender))  # more males than females 
## 
##      Male    Female 
## 0.8136439 0.1863561
loan_ds %>%
  na.omit(Gender) %>%  # should be able to impute this later
  ggplot(aes(x=Gender, fill=Loan_Status)) + 
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Gender) %>%
  group_by(Gender, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.

Married

Married applicants comprise a majority of our dataset (~60%), which is surprising since most applicants have no dependents. Nonetheless, a 60-40 ratio offers a good contrast. Those who are married are 10% more likely to be approved for a loan - a pretty significant difference given the size of our dataset.

prop.table(table(loan_ds$Married))
## 
##        No       Yes 
## 0.3486088 0.6513912
loan_ds %>%
  na.omit(Married) %>%
  ggplot(aes(x=Married,fill=Loan_Status)) +
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Married) %>%
  group_by(Married, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Married'. You can override using the
## `.groups` argument.

Education

Education is a factor representing an applicant’s educational level with levels “Graduate” or “Not Graduate”. Our dataset is comprised of ~80% graduates, which makes sense due to education loans. An 80-20 ratio, however, is definitely unbalanced. From the barplots, we can see that graduates are more likely to get approved (8% more).

prop.table(table(loan_ds$Education))
## 
##     Graduate Not Graduate 
##     0.781759     0.218241
loan_ds %>%
  na.omit(Education) %>%
  ggplot(aes(x=Education,fill=Loan_Status)) +
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Education) %>%
  group_by(Education, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Education'. You can override using the
## `.groups` argument.

Self employed

Self_Employed is a factor that indicates if an applicant is self-employed. Only 14% of applicants in the dataset are self-employed. There is no significant difference in the approval rates between self-employed and non self-employed individuals; a slight difference of 4%, with non self-employed individuals having the higher rate.

prop.table(table(loan_ds$Self_Employed))
## 
##       Yes        No 
## 0.1408935 0.8591065
loan_ds %>%
  na.omit(Self_Employed) %>%
  ggplot(aes(x=Self_Employed,fill=Loan_Status)) +
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Self_Employed) %>%
  group_by(Self_Employed, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Self_Employed'. You can override using the
## `.groups` argument.

Credit History

Credit_History is a factor indicating whether an applicant’s credit satisfies the bank’s requirements. Most applicants do have good credit history (~85%). Credit history appears to be a very important factor, given that nearly 80% of applicants with good credit history get approved, whereas only 10% of applicants with bad credit history do. It is important to note, however, that there may be some selection bias since individuals with good credit history may be more inclined to apply in the first place.

prop.table(table(loan_ds$Credit_History))
## 
##       Yes        No 
## 0.8421986 0.1578014
loan_ds %>%
  na.omit(Credit_History) %>%
  ggplot(aes(x=Credit_History,fill=Loan_Status)) +
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Credit_History) %>%
  group_by(Credit_History, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Credit_History'. You can override using
## the `.groups` argument.

Property Area

Property_Area is a factor representing the area in which an applicant resides: Urban, Semi Urban, or Rural. We have a good mix of applicants from all 3 areas; those residing in semi urban areas have the highest approval rate. We also observe that married individuals tend to prefer semi urban areas over others. This is a pretty insightful finding because, as we recall, married individuals had a 10% higher approval rate than non-married individuals. Coupled with a higher approval rate for those with coapplicants, we may just find our target demographic!

prop.table(table(loan_ds$Property_Area))
## 
##     Rural Semiurban     Urban 
## 0.2915309 0.3794788 0.3289902
loan_ds %>%
  na.omit(Property_Area) %>%
  ggplot(aes(x=Property_Area,fill=Loan_Status)) +
  geom_bar(position="fill")

# 2-way contingency table 
loan_ds %>%
  na.omit(Property_Area) %>%
  group_by(Property_Area, Loan_Status) %>% 
  dplyr:: summarise(n=n()) %>%
  dplyr::mutate(freq = prop.table(n))  
## `summarise()` has grouped output by 'Property_Area'. You can override using the
## `.groups` argument.
# is property area related to any other predictors?
loan_ds %>%
  na.omit(Property_Area, Loan_Status, Married) %>%
  dplyr:: mutate(has_coapp = if_else(CoapplicantIncome != 0,TRUE,FALSE)) %>%
  ggplot(aes(x=Property_Area,fill=Married)) +
  geom_bar(position="dodge") +
  facet_wrap(~has_coapp)

Final tidying

Now that we’ve explored the dataset, we will need to fix some errors before continuing with our analysis. Let’s review these issues:

  • Missingness observed in 7 variables;

  • Extreme high outliers observed in ApplicantIncome and LoanAmount.

MICE

Missingness exists in both numerical and categorical variables. Therefore, we will be using the mice package, which imputes missing values with plausible data values inferred from other variables in the dataset.

# install and load 
# install.packages("mice")
library(mice)

From the missing data table below, we see that the first two variables are missing a large proportion of its values, while the latter five are missing none.

loan_ds %>%
  miss_var_summary()

Now, we call the mice package. The argument m indicates the number of multiple imputations; the standard is m = 5. The method argument specifies the imputation method applied to all variables in the dataset; a separate method can also be specified for each variable.

We can control the defaultMethod used for different types of data. I will choose predictive mean matching for numeric data, logistic regression for 2-level factors, linear discriminant analysis for unordered factor data, and proportional odds for ordered factor data.

imp <- mice(loan_ds, m=5, defautMethod = c("pmm","logreg", "lda", "polr"))

Here, we can see the actual imputations for Dependents:

imp$imp$Dependents

Now let’s merge the imputed data into our original dataset via the complete() function.

loan_ds <- complete(imp,5)  # I chose the 5th round of data imputation

Check missing data again, we note that there is no missing data after the imputation:

loan_ds %>%
  miss_var_summary()

Outliers

Outliers can be tricky. It’s hard to determine if they are data entry errors, sampling errors, or natural variation in our data. If we decide to remove records, however, it may result in information loss. We will assume that the missing values are systematic until proven otherwise.

Looking at LoanAmount, we see that the “extreme” values are somewhat plausible. Some customers may want to apply for a loan as high as 650 thousand.

zscore <- (abs(loan_ds$LoanAmount-mean(loan_ds$LoanAmount, na.rm=T))/sd(loan_ds$LoanAmount, na.rm=T))
loan_ds$LoanAmount[which(zscore > 3)]
##  [1] 650 600 700 495 436 480 480 490 570 405 500 480 480 600 496

Since we have a positive skew, we will perform a log transformation to normalize the data. Now the data looks closer to normal and the effect of extreme outliers are significantly smaller.

loan_ds$LogLoanAmount <- log(loan_ds$LoanAmount)
plot1 <- loan_ds %>% 
  ggplot(aes(x=LoanAmount)) +
  geom_histogram(bins=20) +
  geom_density()+
  labs(title="Histogram for Loan Amount") +
  xlab("Loan Amount") 

plot2 <- loan_ds %>% 
  ggplot(aes(x=LogLoanAmount)) +
  geom_histogram(bins=20) +
  geom_density()+
  labs(title="Histogram for Log Loan Amount") +
  xlab("Log Loan Amount") 

grid.arrange(plot1,plot2,ncol=2)

We also have a pretty severe positive skew for ApplicantIncome, so we will perform a log transformation as well. The data looks much better.

loan_ds$LogApplicantIncome <- log(loan_ds$ApplicantIncome)
plot1 <- loan_ds %>% 
  ggplot(aes(x=ApplicantIncome)) +
  geom_histogram(bins=20) +
  geom_density()+
  labs(title="Histogram for Applicant Income") +
  xlab("Applicant Income") 

plot2 <- loan_ds %>% 
  ggplot(aes(x=LogApplicantIncome)) +
  geom_histogram(bins=20) +
  geom_density()+
  labs(title="Histogram for Log Applicant Income") +
  xlab("Log Applicant Income") 

grid.arrange(plot1,plot2,ncol=2)

Now, we will remove the original variables from our dataset

loan_ds <- select(loan_ds,-LoanAmount)  # remove original variable
loan_ds <- select(loan_ds,-ApplicantIncome)  # remove original variable 

Setting Up Models

Now that we have a better idea of how the variables in our dataset impact loan status, it’s time to set up our models. We will perform our train/test split, create our recipe, then establish 10-fold cross-validation to help with our models.

Train/test split

Before we do any modeling, we will need to randomly split our dataset into a train and test set. The reason why we split our data is to avoid overfitting; we will fit the models on the training data, then use those models to make predictions on the previously unseen testing data. The testing set is reserved to be fit only once after the models have “learned” from the train set. From there, we will use error metrics to evaluate each model’s performance. We will use a 70/30 split since our dataset is relatively small and we want to reserve enough data for the test set. We will set a random seed before our split so that we can replicate our results, and stratify on our response.

set.seed(3450)
loan_split <- initial_split(loan_ds, prop = 0.70, 
                              strata = "Loan_Status")

loan_train <- training(loan_split)
loan_test <- testing(loan_split)
loan_folds <- vfold_cv(loan_train, v = 10, strata = "Loan_Status")

Dimensions of our datasets:

dim(loan_train); dim(loan_test)
## [1] 429  12
## [1] 185  12

Building the recipe

Now that we’ve completed all the preliminary steps, it’s time to build our recipe. Think of it as following a recipe for cut-out cookies. Because we’ll be using a variety of different molds (models), each cookie will look different, but their ingredients will be the same! Inside, they’re all the same flour and sugar and eggs! That’s what this recipe is; a unique mix of ingredients that will be fitted to different molds. Our goal turns into finding the best mold for our particular mix. From there, fitting the best model to our test data is analogous to using a different brand of the essential ingredients (ie., the test data), shaping the dough with our best cookie mold, then putting it into the oven!

In our recipe, we’ll be using 8 out of the 11 original predictors, 2 transformed variables LogLoanAmount and LogApplicantIncome, plus a new variable Coapplicant.

We’ll first need to upsample the data. Recall from earlier that our response was severely imbalanced; if we train our models on an imbalanced dataset, they can accidentally become better at identifying one level versus another, which is undesirable. Two solutions come to mind: upsampling or downsampling. Since we have a small dataset, step_upsample() is the better option. We’ll use over_ratio=1 so that are equally as many Yes’s as there are No’s. Because upsampling is intended to be performed on the training set alone, the default skip option is skip=TRUE. We’ll use skip=FALSE to make sure that it’s brought the counts to be equal and then rewrite the recipe without.

Since the values of CoapplicantIncome do not appear to affect our response, we’ll transform it into a categorical variable Coappliant to indicate the presence or absence of a coapplicant. We’ll then scale and center our numeric predictors, and dummy-code the nominal predictors.

loan_recipe <- recipe(Loan_Status~., data=loan_train) %>%
  step_upsample(Loan_Status, over_ratio = 1, skip = FALSE) %>% 
  step_mutate(Coapplicant = factor(if_else(CoapplicantIncome!=0, "Yes","No",NA))) %>%
  step_rm(CoapplicantIncome)  %>% 
    # transform coapplicant income into a factor
    # Yes if CoapplicantIncome is not 0, No otherwise. 
  step_scale(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%  # scale and center
  step_dummy(all_nominal_predictors())    # convert into factor 
prep(loan_recipe) %>% bake(new_data = loan_train) %>% 
  group_by(Loan_Status) %>% 
  dplyr :: summarise(count = n())

Now we rewrite the recipe with skip=TRUE:

loan_recipe <- recipe(Loan_Status~., data=loan_train) %>%
  step_upsample(Loan_Status, over_ratio = 1, skip = TRUE) %>% 
  step_mutate(Coapplicant = factor(if_else(CoapplicantIncome!=0, "Yes","No",NA))) %>%
  step_rm(CoapplicantIncome)  %>% 
    # transform coapplicant income into a factor
    # Yes if CoapplicantIncome is not 0, No otherwise. 
  step_scale(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%  # scale and center
  step_dummy(all_nominal_predictors())    # convert into factor 

We can use prep() to check the recipe to verify it worked.

prep(loan_recipe) %>% 
  bake(new_data = loan_train) %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
Loan_Amount_Term LogLoanAmount LogApplicantIncome Loan_Status Gender_Female Married_Yes Dependents_X1 Dependents_X2 Dependents_X3 Education_Not.Graduate Self_Employed_No Credit_History_No Property_Area_Semiurban Property_Area_Urban Coapplicant_Yes
0.2412411 -0.0491444 0.1280076 No 0 1 1 0 0 0 1 0 0 0 1
0.2412411 0.3564005 -0.4959078 No 0 1 0 0 1 0 1 1 1 0 1
0.2412411 -0.2722349 -1.2439388 No 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -1.0531552 -0.2761123 No 1 0 0 0 0 0 1 1 0 1 0
0.2412411 -0.4490550 0.9062231 No 0 1 0 0 0 1 1 1 0 1 0
0.2412411 -0.2387386 -0.7307845 No 0 1 0 0 0 1 1 1 1 0 1
0.2412411 0.2691240 -0.1892989 No 0 1 1 0 0 0 1 1 1 0 1
0.2412411 -2.5465367 -1.6238742 No 0 0 0 0 0 1 1 0 0 1 0
0.2412411 0.8200044 -0.0165239 No 0 1 1 0 0 0 0 1 0 1 1
0.2412411 -1.1045179 -0.4319064 No 0 0 0 0 0 0 1 0 0 1 0
0.2412411 1.7156191 1.6481662 No 0 0 0 0 1 0 1 0 0 0 1
0.2412411 -0.5245936 -1.2645184 No 0 1 0 0 0 0 1 1 0 1 1
0.2412411 -0.9543650 -0.2377547 No 0 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.1734447 -0.7047869 No 1 1 0 0 0 0 1 1 0 1 1
0.2412411 -0.3063241 0.0065739 No 1 0 0 0 0 0 1 0 1 0 0
0.2412411 0.0390841 0.1388776 No 0 1 0 1 0 0 1 0 0 1 0
0.2412411 1.4992746 1.5218393 No 1 1 1 0 0 0 0 1 0 1 0
0.2412411 -0.1102920 0.2431861 No 0 1 1 0 0 0 1 1 0 0 0
0.2412411 -0.2387386 -0.0165239 No 1 0 0 0 0 0 1 1 1 0 0
0.2412411 1.3008360 0.4653528 No 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.0676178 0.0314404 No 1 0 0 0 0 0 1 1 1 0 0
0.2412411 0.5641932 -0.1759075 No 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -0.3410274 -2.1784279 No 0 1 1 0 0 0 0 0 0 1 1
0.2412411 0.0957349 0.7217816 No 0 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.1734447 0.8401258 No 0 0 0 0 0 1 1 0 0 0 0
0.2412411 0.9586244 -0.1558404 No 0 1 0 1 0 0 1 1 0 1 1
0.2412411 0.2818369 -0.2495844 No 0 1 0 0 0 1 1 1 0 0 1
0.2412411 0.5532189 0.4229599 No 0 1 0 0 0 0 1 0 0 0 1
0.2412411 1.5260249 -0.0781287 No 0 1 0 0 1 0 1 0 1 0 1
0.2412411 -0.8603958 0.1280076 No 1 1 0 0 0 0 1 0 0 0 0
0.2412411 1.1775307 1.9242907 No 0 0 0 0 0 0 1 1 1 0 0
0.2412411 -0.0341561 -0.0042094 No 0 1 0 1 0 1 1 0 0 0 1
0.2412411 -0.2058150 0.6045475 No 0 0 0 0 0 0 1 1 0 0 0
0.2412411 1.1933826 1.1781572 No 0 0 0 0 0 0 1 0 0 1 0
0.2412411 1.4237361 0.9592790 No 0 1 0 0 0 0 1 0 0 1 0
2.2346902 -1.4144676 -0.9586106 No 0 0 0 0 0 0 1 1 1 0 0
-2.7489324 -0.9304394 -0.2709415 No 0 1 0 0 1 1 1 0 0 0 0
2.2346902 -0.4864540 -0.8949495 No 0 0 0 0 0 1 1 1 0 1 1
0.2412411 -1.2959079 -1.1932851 No 0 0 0 0 0 0 1 0 0 0 1
-0.7554834 -1.3248707 -0.9991058 No 0 1 1 0 0 0 0 1 0 0 0
0.2412411 0.4281847 1.5863190 No 0 0 0 0 0 0 1 0 1 0 0
0.2412411 0.3806271 0.5444794 No 0 1 0 0 0 1 1 0 0 1 0
0.2412411 -0.8834618 1.4544928 No 0 0 0 0 0 0 0 0 0 1 0
-2.7489324 -0.0794755 -0.0811618 No 0 1 0 0 1 1 1 0 0 1 0
0.2412411 -0.0642503 -0.3163526 No 0 0 0 0 0 0 1 0 1 0 0
0.2412411 -0.0491444 -0.3110429 No 0 1 0 0 1 0 1 1 1 0 1
0.2412411 0.0534038 -5.0526513 No 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -1.1572879 0.2599440 No 0 1 0 1 0 0 0 1 1 0 0
0.2412411 -1.5084368 -0.9694858 No 0 1 0 0 0 0 1 1 0 1 0
0.2412411 0.0957349 0.1774403 No 0 1 1 0 0 1 0 0 0 1 0
0.2412411 1.2783096 0.4706354 No 0 1 0 0 0 0 1 1 0 1 0
-2.7489324 0.1234474 -0.7841576 No 0 1 1 0 0 1 1 0 0 1 1
0.2412411 0.8200044 -1.4032922 No 0 1 0 0 0 1 1 1 1 0 1
0.2412411 0.0957349 -0.3123686 No 1 0 0 1 0 0 1 0 0 1 0
0.2412411 0.7317759 2.0456601 No 0 0 0 0 0 0 0 1 0 1 0
0.2412411 -0.2554140 0.5474899 No 0 0 0 0 0 1 1 1 0 0 0
0.2412411 -0.6437648 0.3338633 No 0 1 0 0 1 0 1 0 0 1 0
0.2412411 1.6168289 1.8920306 No 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -0.0491444 0.2388907 No 0 1 0 0 1 1 1 0 1 0 0
0.2412411 2.4965394 2.4099457 No 0 1 0 0 0 0 1 0 0 0 0
-0.7554834 -0.5439504 -0.7307845 No 1 0 0 0 0 0 0 0 1 0 1
0.2412411 -0.0642503 0.0662706 No 0 1 0 1 0 0 1 1 1 0 0
2.2346902 0.1642826 0.3813055 No 1 0 0 0 0 0 1 1 0 1 0
0.2412411 -0.2892041 -1.1177099 No 0 1 1 0 0 0 1 0 0 1 1
0.2412411 -0.6643637 -1.2879043 No 0 1 0 0 0 1 1 1 0 1 1
0.2412411 2.4965394 2.3777914 No 0 0 0 0 0 0 1 0 0 0 0
0.2412411 0.3685519 -0.8066571 No 0 1 0 0 0 0 1 0 0 0 1
0.2412411 2.0262194 1.0931727 No 0 1 1 0 0 0 0 1 0 0 1
0.2412411 -0.3410274 0.0430231 No 1 1 0 0 0 0 1 0 0 1 1
0.2412411 -1.7473605 -1.2628617 No 1 0 0 0 1 1 1 1 0 1 0
0.2412411 0.5310808 -0.7036417 No 0 1 0 0 1 1 1 0 0 0 1
0.2412411 -2.0201775 -0.8660048 No 1 0 0 0 0 0 1 0 0 0 0
0.2412411 1.6420056 1.1077874 No 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.4306244 0.4043432 No 1 1 0 0 0 0 0 1 0 0 0
0.2412411 -1.2959079 -0.1262294 No 0 1 1 0 0 0 1 0 0 1 0
2.2346902 -0.5245936 -0.5139802 No 0 1 0 0 0 1 1 1 0 1 1
0.2412411 0.0246572 0.5897523 No 0 0 0 0 0 1 1 0 0 0 0
2.2346902 -0.6032162 -0.7902057 No 0 0 0 0 0 0 1 0 1 0 0
0.2412411 2.8275207 2.3396507 No 0 1 0 0 0 0 1 0 0 0 1
-2.7489324 -0.3235976 -0.4829884 No 0 0 1 0 0 0 1 1 0 1 1
0.2412411 -0.5245936 0.1223773 No 0 1 0 1 0 0 1 0 0 1 0
0.2412411 -1.1842253 -0.4257001 No 1 0 0 0 0 0 1 1 0 1 0
0.2412411 -0.4676637 -1.5641298 No 1 0 0 0 0 0 1 1 1 0 1
-2.7489324 -2.0625086 -0.5711002 No 0 1 0 1 0 1 1 1 0 1 0
-0.7554834 -1.3542756 -0.6430543 No 0 0 0 0 0 1 1 0 0 0 0
0.2412411 0.7517345 0.8622766 No 1 0 0 0 0 0 0 0 0 0 0
-0.7554834 0.2818369 1.0277957 No 0 1 1 0 0 0 1 1 1 0 0
0.2412411 0.9763756 0.5602183 No 0 1 0 0 0 1 1 1 0 0 0
0.2412411 0.5859560 -1.0194140 No 1 1 0 0 0 1 1 1 1 0 1
0.2412411 0.3806271 -0.7307845 No 1 0 1 0 0 0 1 0 0 1 0
0.2412411 -0.3410274 0.8228797 No 0 1 1 0 0 0 1 1 0 1 1
0.2412411 -0.4306244 0.6574378 No 0 0 0 0 0 0 1 1 0 0 0
0.2412411 -1.6760195 -0.2636463 No 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -0.0948221 -0.9106157 No 0 1 0 1 0 1 1 1 0 0 1
0.2412411 0.2434438 -0.0979534 No 0 1 0 0 0 0 1 1 0 0 1
0.2412411 0.5310808 -0.1929718 No 0 1 0 0 0 0 1 0 0 1 1
-2.7489324 -0.2892041 -0.7002114 No 0 1 1 0 0 1 1 1 0 0 1
0.2412411 0.0101214 -0.5795142 No 0 1 0 1 0 0 1 1 1 0 1
0.2412411 0.0246572 -0.5409852 No 0 1 0 1 0 0 1 0 0 1 1
0.2412411 2.5751620 2.2283894 No 0 1 1 0 0 0 1 0 0 1 0
0.2412411 -1.0786653 -0.8301180 No 0 0 0 0 0 0 1 0 0 1 0
0.2412411 0.8951744 0.6059950 No 0 0 0 0 0 0 1 1 0 0 1
0.2412411 0.0957349 -0.0593080 No 0 1 1 0 0 1 1 0 0 0 1
0.2412411 -0.6032162 -0.2449260 No 1 0 0 0 0 0 1 0 0 1 0
-2.7489324 -1.2115447 -1.0328686 No 0 1 0 0 0 0 1 0 1 0 1
0.2412411 1.6606744 0.4887202 No 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.1258871 -0.4770625 No 0 1 0 0 0 1 1 1 1 0 1
0.2412411 0.6602456 0.1607100 No 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.6233838 -0.3243523 No 1 0 0 0 0 1 1 0 0 0 0
0.2412411 0.0957349 0.1597391 No 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.3410274 -0.2428133 No 0 0 0 1 0 0 1 1 0 0 0
0.2412411 0.7714884 1.4885402 No 0 0 1 0 0 0 1 0 1 0 0
0.2412411 -1.3248707 -0.5564872 No 0 1 0 0 0 1 1 0 0 0 1
0.2412411 -0.5439504 -1.0271890 No 1 0 0 0 0 0 1 1 1 0 0
0.2412411 0.3806271 0.6881251 No 0 1 0 1 0 0 0 1 0 0 0
0.2412411 0.6181459 -0.8183421 No 0 1 0 0 0 1 1 1 0 1 1
-2.7489324 0.0101214 -0.2407035 No 0 1 0 0 0 0 1 1 0 0 1
0.2412411 1.3008360 1.1162480 No 0 1 0 1 0 0 1 1 0 1 0
0.2412411 -5.1622641 -0.8660048 No 1 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.4490550 -0.4323849 No 0 1 0 0 0 0 1 1 0 1 1
0.2412411 0.4398929 0.2946915 No 0 1 0 0 0 0 1 1 0 1 1
2.2346902 0.6809554 0.5674086 No 0 1 0 1 0 1 1 0 1 0 1
0.2412411 0.2563267 0.6353679 No 0 1 0 0 1 0 1 0 1 0 0
0.2412411 0.0676178 -0.4706689 No 0 1 0 0 0 0 1 1 1 0 1
0.2412411 -0.7275165 -1.2579024 No 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -1.4766016 -1.2220224 No 0 1 1 0 0 0 1 0 0 0 0
0.2412411 0.2042699 -0.6255580 No 0 1 1 0 0 0 1 1 0 0 1
-4.3436917 0.5199157 0.0254388 No 0 1 1 0 0 0 1 0 0 0 1
0.2412411 -0.4123684 -0.6590866 No 0 1 0 1 0 0 0 1 1 0 0
-1.7522079 0.8579561 0.5361694 No 0 1 0 1 0 0 0 0 1 0 0
0.2412411 0.6809554 0.6299185 No 0 1 0 1 0 1 0 0 0 0 1
-2.7489324 1.8882109 -3.5072290 No 1 0 0 0 1 0 1 1 0 1 1
-2.7489324 -0.4864540 -0.8520528 No 0 1 0 0 0 1 1 0 0 1 1
0.2412411 0.0246572 0.1280076 No 1 0 0 0 0 0 0 1 1 0 0
0.2412411 0.0246572 0.4975528 Yes 0 0 0 0 0 0 1 0 0 1 0
0.2412411 -1.3248707 -0.5139802 Yes 0 1 0 0 0 0 0 0 0 1 0
0.2412411 -0.1734447 -0.7407231 Yes 0 1 0 0 0 1 1 0 0 1 1
0.2412411 0.1371555 0.5361694 Yes 0 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.6233838 -0.8949495 Yes 0 1 0 0 0 1 1 0 0 1 1
0.2412411 0.4745962 -0.0758578 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 -1.2115447 -0.4162014 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 0.8103985 -0.4775555 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 -0.0948221 0.2447172 Yes 0 0 0 0 0 0 1 0 0 1 0
-1.7522079 -0.5245936 -0.2394390 Yes 0 0 1 0 0 1 1 0 0 1 0
0.2412411 -0.2554140 -0.7307845 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 1.6852880 0.5247638 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 0.7217185 1.2419205 Yes 0 1 0 0 0 0 0 0 1 0 0
0.2412411 -0.1416095 -0.6190487 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.1734447 -0.1759075 Yes 1 0 0 1 0 0 0 0 1 0 1
0.2412411 0.1777041 -0.2098172 Yes 1 1 0 0 0 0 1 0 1 0 1
0.2412411 0.6498067 -0.0165239 Yes 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -1.9787568 -1.2879043 Yes 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -1.0786653 -0.8520528 Yes 0 1 0 0 0 0 1 0 0 1 0
0.2412411 -0.6032162 0.1645873 Yes 0 1 0 0 0 1 0 0 0 1 0
0.2412411 -0.7707989 -0.3199028 Yes 1 0 0 0 0 0 1 0 0 1 0
0.2412411 -2.1057910 0.4448411 Yes 0 1 1 0 0 0 1 0 0 1 0
0.2412411 0.1777041 0.4902826 Yes 0 1 0 0 0 0 1 0 0 1 0
0.2412411 0.1777041 -0.0781287 Yes 1 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.5245936 -1.1838259 Yes 1 1 0 0 0 1 1 0 1 0 1
0.2412411 -0.1734447 -0.4711597 Yes 1 0 0 0 0 0 1 0 1 0 0
0.2412411 -0.6032162 -1.0314467 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.6074756 0.9870961 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 0.1777041 -0.3436354 Yes 0 1 0 1 0 1 1 0 0 1 1
0.2412411 -0.1734447 -0.7902057 Yes 0 1 0 0 0 0 1 0 0 1 1
-0.7554834 -0.9304394 -0.2804349 Yes 0 0 0 0 0 0 1 0 1 0 0
0.2412411 -0.3235976 -0.1376105 Yes 1 0 0 0 0 0 1 0 1 0 0
-1.7522079 -1.8595856 -0.0826807 Yes 0 1 1 0 0 0 1 0 0 1 0
0.2412411 0.0676178 -0.8736695 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.5439504 -0.3545057 Yes 0 1 0 1 0 1 1 0 1 0 1
0.2412411 -0.4490550 -0.7902057 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.5532189 0.4571282 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.0045250 -0.5353407 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.1416095 -0.0285728 Yes 0 0 0 0 0 0 1 0 1 0 0
-3.7456569 -3.1945777 -0.2293611 Yes 0 0 0 0 0 1 1 0 1 0 0
0.2412411 -1.8595856 -1.1458023 Yes 0 1 0 0 0 0 1 0 1 0 1
-2.7489324 -0.2554140 -0.0085443 Yes 0 1 0 0 0 1 1 0 1 0 0
0.2412411 -0.0045250 -1.3228128 Yes 0 1 0 0 0 0 1 0 1 0 1
-2.7489324 0.0246572 0.0272064 Yes 0 1 0 1 0 1 1 0 0 1 1
0.2412411 0.2691240 0.2116086 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 0.3806271 -0.1494746 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 -0.5245936 -0.4879444 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 1.0372470 1.5108650 Yes 0 1 0 1 0 0 1 0 0 1 1
2.2346902 -0.6437648 -1.0754269 Yes 0 1 0 0 1 1 1 0 1 0 1
0.2412411 0.0676178 0.3527908 Yes 0 0 0 0 0 0 1 0 0 1 0
0.2412411 0.1096410 -0.5502674 Yes 1 1 0 0 0 0 0 0 1 0 1
0.2412411 -0.2387386 0.8643113 Yes 1 0 1 0 0 0 0 0 1 0 0
0.2412411 -0.2058150 0.2750192 Yes 0 0 0 0 0 0 1 0 1 0 0
-2.7489324 0.6602456 1.8816769 Yes 0 1 1 0 0 0 1 0 0 0 0
0.2412411 0.3070135 -0.4319064 Yes 1 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.8376028 -0.9742684 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 1.3082866 1.3706797 Yes 1 0 0 0 0 0 1 0 0 1 0
0.2412411 0.6074756 0.4496609 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -2.1057910 -0.0165239 Yes 1 0 0 0 0 0 1 0 1 0 0
0.2412411 0.0817277 -1.0278978 Yes 1 0 0 0 0 0 1 1 1 0 1
0.2412411 -0.9304394 -0.5358530 Yes 0 1 0 1 0 0 1 0 1 0 0
0.2412411 0.7517345 0.0314404 Yes 0 1 0 0 0 1 1 0 0 0 1
0.2412411 1.9952377 2.6239813 Yes 0 1 0 0 1 0 1 0 0 0 0
0.2412411 -0.7928104 -0.1301443 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -1.1045179 -1.1177099 Yes 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -1.2115447 -0.6635394 Yes 0 0 0 0 0 0 1 0 1 0 0
-3.7456569 -3.1945777 -0.2982873 Yes 0 1 0 0 0 0 0 0 1 0 0
0.2412411 -0.4864540 0.2277892 Yes 0 0 0 0 0 0 1 0 1 0 0
0.2412411 0.4745962 0.3813055 Yes 0 0 0 0 0 0 1 0 0 1 0
-2.7489324 0.6809554 1.5553245 Yes 0 1 0 1 0 0 1 0 0 1 0
0.2412411 -0.5245936 -0.8968989 Yes 1 1 0 0 0 0 1 0 1 0 1
0.2412411 -1.2115447 1.9107965 Yes 1 1 0 1 0 0 1 0 0 1 0
0.2412411 -2.8434289 -1.5262267 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 0.2818369 -0.7442464 Yes 0 1 0 0 0 0 0 0 0 0 1
-2.7489324 2.9263109 3.4103549 Yes 0 1 0 0 1 0 1 1 1 0 0
0.2412411 0.3806271 0.5361694 Yes 0 1 1 0 0 0 1 0 0 0 0
0.2412411 -0.1734447 -0.5241143 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.5635037 -1.2357846 Yes 0 1 0 0 0 1 1 0 1 0 1
0.2412411 0.3564005 -0.0161603 Yes 0 1 0 1 0 0 1 0 0 0 1
-2.7489324 -1.0786653 1.2038879 Yes 0 1 0 0 0 0 1 0 0 1 0
0.2412411 0.8103985 0.9720209 Yes 0 1 0 1 0 0 1 0 1 0 0
0.2412411 -0.6233838 -1.2711633 Yes 0 1 0 0 0 1 0 0 0 0 1
0.2412411 -0.2387386 -0.2817340 Yes 0 1 0 0 0 0 1 0 0 0 1
-2.7489324 -1.1307222 0.1336170 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 -0.3763676 -0.2272699 Yes 1 1 0 0 0 0 1 0 1 0 0
0.2412411 -1.9382082 0.2246909 Yes 0 0 0 0 0 1 1 0 0 0 0
0.2412411 -0.2892041 -0.3172394 Yes 1 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.7275165 -0.7307845 Yes 0 1 1 0 0 1 1 0 1 0 1
0.2412411 0.4515303 0.2364307 Yes 0 0 0 1 0 0 1 1 1 0 0
0.2412411 0.0534038 -0.2804349 Yes 0 1 1 0 0 1 1 0 0 1 1
0.2412411 -0.1102920 -0.1098966 Yes 0 1 0 1 0 1 1 0 1 0 0
0.2412411 -0.1734447 0.0690227 Yes 1 0 0 0 0 1 1 0 1 0 0
0.2412411 -0.9543650 -0.3955114 Yes 1 0 0 0 0 0 1 0 0 1 0
0.2412411 -1.6760195 -0.0876275 Yes 0 0 0 0 0 1 1 0 0 0 1
0.2412411 -1.5408071 -0.8029858 Yes 0 0 0 0 0 0 1 0 0 1 0
0.2412411 1.1615474 0.9311361 Yes 0 1 1 0 0 0 0 0 0 1 0
0.2412411 -0.0192835 0.4592549 Yes 0 1 0 0 1 1 0 0 0 0 0
0.2412411 0.0817277 -0.4290388 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 -0.0045250 -0.1852284 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.1777041 -0.5286969 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 0.5532189 0.8899172 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.0491444 0.5980166 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 0.9043677 0.5980166 Yes 0 1 0 1 0 0 1 0 1 0 1
0.2412411 0.8103985 0.6339482 Yes 0 0 0 0 0 0 0 0 0 0 0
0.2412411 -0.4490550 -0.7956697 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 -0.1734447 -0.4214183 Yes 0 0 0 0 0 1 1 0 0 0 1
0.2412411 1.4584395 1.0338075 Yes 1 0 0 0 0 0 1 0 1 0 0
0.2412411 0.1234474 -0.4376579 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 0.4973893 0.4043432 Yes 0 1 1 0 0 0 1 0 0 0 1
0.2412411 -0.1416095 -0.2965363 Yes 1 0 0 0 0 0 0 0 0 1 0
0.2412411 -0.6032162 -0.3627099 Yes 0 1 1 0 0 0 1 0 1 0 0
-4.7423815 0.3194795 -0.5317596 Yes 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -0.3763676 0.5980166 Yes 0 1 0 1 0 0 0 0 0 0 1
0.2412411 -0.1258871 -0.3216810 Yes 0 1 0 0 0 1 1 0 1 0 1
0.2412411 -0.3063241 1.2644150 Yes 0 1 0 1 0 0 1 0 0 1 0
-2.7489324 0.0817277 0.6824040 Yes 0 1 1 0 0 1 1 0 0 1 0
2.2346902 -0.1258871 -0.5747756 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 -0.7275165 -1.2803480 Yes 0 1 0 0 0 0 1 0 0 1 1
-2.7489324 1.4515486 -0.6956497 Yes 0 1 1 0 0 1 1 0 1 0 1
0.2412411 1.2783096 -0.4726333 Yes 1 0 0 0 1 0 1 0 0 0 0
0.2412411 1.7748849 0.5569839 Yes 0 1 1 0 0 0 1 1 0 1 1
0.2412411 0.0390841 -1.0834954 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.1734447 0.8123949 Yes 1 0 0 0 0 0 1 0 0 0 0
0.2412411 -0.0491444 0.3094265 Yes 0 0 0 0 0 0 0 0 1 0 0
0.2412411 0.2563267 0.1687765 Yes 0 1 0 1 0 0 1 0 1 0 1
0.2412411 0.3806271 0.0430231 Yes 0 1 0 0 1 0 1 1 0 1 1
0.2412411 -0.7275165 -0.5779331 Yes 1 0 1 0 0 0 1 0 0 1 1
0.2412411 -2.8434289 -0.3987842 Yes 1 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.0794755 -0.6007560 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 0.2563267 -0.7191749 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.2554140 -0.6458064 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 0.8766552 -0.0807824 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.9543650 -0.4628366 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 2.3113677 1.8816769 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 -0.3063241 -0.1060337 Yes 0 1 0 0 0 0 1 0 1 0 1
-2.7489324 -1.0031268 -0.9158740 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 -0.7490361 -0.4362180 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.1734447 -0.1946070 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -0.2554140 -0.0318756 Yes 1 0 0 0 0 0 1 0 1 0 0
0.2412411 0.6809554 1.2336572 Yes 0 0 0 0 0 0 1 0 0 0 0
0.2412411 0.1096410 -0.4765697 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.0390841 -0.4386186 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 0.0534038 0.1552002 Yes 1 1 1 0 0 0 1 0 0 1 0
0.2412411 -0.4306244 -0.5784599 Yes 0 1 0 0 1 1 1 0 1 0 1
0.2412411 0.0101214 0.2599440 Yes 1 0 0 0 0 0 1 0 0 0 0
0.2412411 0.1234474 -0.0781287 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 0.0534038 -0.1710671 Yes 1 0 0 0 0 0 1 0 0 0 1
0.2412411 0.6602456 0.9076070 Yes 0 1 0 1 0 1 1 0 0 0 0
0.2412411 -0.8603958 -0.5564872 Yes 1 0 0 0 0 0 1 0 1 0 0
0.2412411 -1.6413161 -0.7859695 Yes 1 0 0 0 0 0 1 0 0 0 0
-4.3436917 0.1777041 0.4706354 Yes 0 1 0 1 0 0 0 0 0 0 0
0.2412411 -0.1734447 -0.1892989 Yes 0 1 0 1 0 0 1 0 1 0 0
0.2412411 -0.3063241 -0.2627903 Yes 1 0 1 0 0 0 1 0 1 0 0
0.2412411 -0.1102920 -0.2373339 Yes 0 1 0 1 0 1 1 0 0 0 1
0.2412411 0.0101214 -0.4323849 Yes 1 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.0192835 0.2345831 Yes 0 0 0 0 0 0 1 1 0 0 0
0.2412411 -0.2222065 0.0607513 Yes 0 0 1 0 0 0 1 0 0 1 1
0.2412411 -0.5635037 -0.5471670 Yes 0 0 0 0 0 0 1 0 1 0 0
0.2412411 -1.2115447 0.4051694 Yes 0 1 0 1 0 0 0 0 1 0 1
0.2412411 -0.0948221 -0.7407231 Yes 0 1 0 1 0 0 1 0 0 0 1
0.2412411 0.3806271 -0.7908118 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.0794755 -0.4726333 Yes 0 1 0 1 0 1 1 0 0 1 1
0.2412411 1.3082866 0.6180040 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 1.0627572 1.1643756 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 -0.6233838 -0.6928057 Yes 0 1 0 0 1 0 1 0 0 0 1
0.2412411 -0.0192835 -0.8376028 Yes 1 1 0 0 0 0 1 0 1 0 1
-2.7489324 -0.2387386 -0.1506662 Yes 1 0 1 0 0 0 1 0 0 1 0
0.2412411 0.4398929 1.0338075 Yes 0 1 0 1 0 0 1 0 0 0 1
0.2412411 0.8103985 0.3097198 Yes 0 1 0 0 1 0 1 0 1 0 1
0.2412411 1.1291771 0.2599440 Yes 0 1 0 1 0 0 1 0 1 0 1
0.2412411 -0.0192835 0.1822323 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 0.1371555 0.7306910 Yes 0 0 0 0 0 0 0 0 0 0 0
0.2412411 0.5532189 0.6160889 Yes 0 1 1 0 0 0 1 0 1 0 0
0.2412411 -1.6760195 1.9991765 Yes 1 0 0 0 0 0 0 0 1 0 0
-0.7554834 2.0466004 1.2038879 Yes 0 1 0 1 0 0 0 0 0 0 1
0.2412411 -0.3410274 -0.6381130 Yes 1 0 0 0 0 0 1 0 0 1 1
0.2412411 -0.0192835 0.0349597 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 -0.0192835 0.3740163 Yes 0 1 0 1 0 0 1 0 0 1 0
0.2412411 -0.0491444 -0.3545057 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.3318653 0.5361694 Yes 1 0 0 0 0 0 1 0 0 1 0
-2.7489324 -0.2892041 -0.2098172 Yes 0 0 1 0 0 0 0 0 0 1 0
0.2412411 0.0676178 -0.8949495 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 0.6602456 1.1783224 Yes 0 0 0 0 1 0 1 0 0 0 0
0.2412411 1.4237361 1.2330197 Yes 0 1 0 0 1 0 1 0 0 0 0
0.2412411 -0.1734447 -0.7407231 Yes 0 1 0 0 0 0 1 1 0 1 1
-2.7489324 -0.2892041 -1.1335903 Yes 0 1 0 1 0 1 1 0 1 0 1
0.2412411 -0.2892041 -0.4643021 Yes 0 1 0 1 0 0 1 0 0 1 1
0.2412411 0.0534038 -0.3806397 Yes 0 1 0 1 0 0 1 0 1 0 1
0.2412411 -0.6233838 -0.4974057 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 -0.4676637 -0.7902057 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -1.7473605 -1.1565692 Yes 1 0 0 0 0 1 1 0 1 0 0
0.2412411 -0.2554140 -1.0067769 Yes 0 1 0 0 0 1 1 0 0 1 1
0.2412411 0.3806271 0.5994704 Yes 0 1 0 0 0 0 0 0 0 1 0
0.2412411 -0.3410274 -0.9478129 Yes 0 1 0 0 0 1 1 0 0 0 1
0.2412411 -1.5084368 -0.5165074 Yes 1 0 0 0 0 0 1 0 0 1 0
0.2412411 -0.3063241 -2.1410175 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.0957349 -0.3945776 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.0957349 0.4923634 Yes 0 1 0 0 0 0 1 0 0 0 0
0.2412411 -0.5245936 -1.2711633 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 -0.6032162 1.9122226 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 -0.1574613 -0.6928057 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.0246572 0.5196669 Yes 0 1 0 1 0 0 1 0 1 0 0
0.2412411 -0.7928104 -0.5549298 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.2563267 1.0858121 Yes 1 0 1 0 0 0 0 0 1 0 0
0.2412411 2.1693162 1.6930668 Yes 0 0 0 0 0 0 1 0 1 0 0
0.2412411 0.1642826 -0.8363527 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.5245936 -0.1759075 Yes 0 0 0 0 0 0 1 0 0 1 0
-1.7522079 0.6809554 1.3171976 Yes 1 1 0 0 0 0 1 0 1 0 0
0.2412411 -1.8595856 -1.1853984 Yes 0 0 0 0 0 0 1 0 1 0 1
0.2412411 -0.3063241 -0.9749528 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 0.0957349 0.8001410 Yes 1 1 0 0 0 1 0 0 0 0 0
0.2412411 0.6074756 0.9531552 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.2304742 0.1684547 Yes 0 0 0 0 1 1 1 0 1 0 1
0.2412411 0.2818369 3.3214361 Yes 0 0 1 0 0 0 1 0 1 0 0
-0.7554834 0.5532189 0.8401258 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -0.3410274 0.1506475 Yes 0 1 0 1 0 1 1 0 0 0 0
-2.7489324 -0.7275165 -0.8005432 Yes 0 1 0 0 0 0 1 1 0 0 1
0.2412411 -0.8603958 0.7880009 Yes 0 0 0 0 0 0 0 0 1 0 0
0.2412411 0.0676178 0.0503481 Yes 0 0 0 1 0 0 1 0 0 0 0
0.2412411 0.3806271 -1.0666736 Yes 1 1 0 0 0 0 0 0 1 0 1
0.2412411 0.2944663 -0.5064238 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.0101214 0.3167407 Yes 1 0 1 0 0 1 1 0 1 0 0
0.2412411 0.1234474 0.5361694 Yes 0 0 0 0 0 0 1 0 0 0 0
0.2412411 0.4045527 -0.3022345 Yes 0 1 1 0 0 0 0 0 1 0 1
0.2412411 -0.5245936 0.1632960 Yes 0 1 0 0 1 0 1 0 1 0 0
0.2412411 -0.6643637 -0.7902057 Yes 1 0 0 0 0 0 0 0 0 1 0
0.2412411 0.4045527 0.4131325 Yes 0 0 0 1 0 0 1 0 0 0 1
0.2412411 0.2563267 2.0710848 Yes 0 1 0 1 0 0 0 0 0 0 1
0.2412411 -0.8150764 2.0840480 Yes 0 1 1 0 0 0 0 0 1 0 1
0.2412411 0.3070135 0.0489556 Yes 1 0 0 0 0 1 1 0 0 0 0
0.2412411 -0.2892041 -0.4667477 Yes 0 1 0 0 1 1 1 0 0 0 0
0.2412411 1.1127856 1.4313153 Yes 0 1 0 0 0 0 1 0 1 0 0
0.2412411 1.2091050 1.0338075 Yes 0 1 0 1 0 0 1 0 1 0 0
0.2412411 0.3806271 0.1280076 Yes 0 1 0 1 0 0 0 0 1 0 1
0.2412411 -0.6032162 -0.6741674 Yes 0 0 0 0 0 1 1 0 1 0 1
0.2412411 0.6706283 0.3576281 Yes 0 1 1 0 0 1 1 0 0 1 1
0.2412411 -0.3410274 -0.1999337 Yes 0 0 0 0 0 1 1 0 0 0 0
0.2412411 1.0372470 2.1372788 Yes 1 0 0 0 0 1 0 0 1 0 0
-4.3436917 -0.4306244 -0.3576910 Yes 1 1 1 0 0 0 1 0 1 0 1
-5.5397611 -0.3235976 0.1418287 Yes 0 1 0 0 0 0 1 0 0 1 1
2.2346902 -0.2892041 -2.8427830 Yes 1 0 0 0 0 0 1 0 0 0 1
0.2412411 0.8859371 0.2184753 Yes 0 1 0 1 0 0 1 0 1 0 1
-0.7554834 -0.1102920 -0.1502689 Yes 0 1 0 0 0 1 1 0 1 0 1
0.2412411 1.1854730 -0.2761123 Yes 0 1 0 1 0 0 1 0 0 0 1
0.2412411 2.4965394 2.4220661 Yes 0 1 0 0 0 0 1 0 0 1 1
0.2412411 0.6912273 -0.8029858 Yes 0 1 0 0 0 0 0 0 0 1 1
0.2412411 1.2401700 0.5524942 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 0.2563267 -0.2952244 Yes 0 1 0 0 1 0 1 0 0 0 1
2.2346902 -0.2892041 -1.1049752 Yes 1 1 0 1 0 0 1 0 1 0 1
0.2412411 -1.6760195 -0.7902057 Yes 0 0 0 0 0 0 1 0 1 0 0
0.2412411 -0.5245936 0.4523320 Yes 0 1 0 0 1 0 0 0 0 0 1
0.2412411 2.4965394 0.9621410 Yes 0 1 0 1 0 0 0 0 0 0 1
0.2412411 0.3926270 0.3439301 Yes 0 1 1 0 0 1 1 1 1 0 1
0.2412411 -0.0192835 0.7220049 Yes 0 0 0 0 0 1 1 0 1 0 0
0.2412411 -0.5245936 0.0247312 Yes 0 1 0 0 1 0 1 0 0 1 0
0.2412411 -0.0948221 2.2144418 Yes 1 0 0 0 0 1 0 0 0 1 0
-2.7489324 0.1234474 0.1362496 Yes 1 1 1 0 0 0 1 0 1 0 1
0.2412411 -0.6233838 -0.2160272 Yes 1 0 1 0 0 0 1 0 1 0 0
0.2412411 -0.0491444 -0.9572567 Yes 0 1 1 0 0 1 1 0 0 1 1
0.2412411 -0.4864540 -0.5054191 Yes 1 1 0 0 0 1 1 0 1 0 1
0.2412411 0.3194795 -0.6359220 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.1910328 -0.7745305 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 -0.3410274 -0.6266457 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -0.9543650 -0.7950616 Yes 0 1 1 0 0 1 1 0 0 0 1
0.2412411 0.3564005 -0.3545057 Yes 0 1 1 0 0 0 1 0 0 1 1
0.2412411 -3.1190391 0.3955027 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 -0.8603958 -0.6922375 Yes 1 0 0 0 0 0 1 0 0 1 1
0.2412411 1.3157085 1.3310077 Yes 0 1 0 0 1 0 0 0 1 0 0
0.2412411 0.4045527 -0.1215449 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 0.6287574 -0.0114411 Yes 1 1 0 0 0 0 1 0 1 0 1
0.2412411 -0.3763676 -0.2065155 Yes 0 1 0 1 0 1 1 0 1 0 1
0.2412411 2.9263109 2.3206419 Yes 1 1 1 0 0 0 0 0 1 0 0
0.2412411 0.9135173 0.5166005 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 0.0101214 0.4848070 Yes 0 0 0 0 0 1 0 0 1 0 0
0.2412411 -0.1734447 0.0891667 Yes 0 1 0 0 0 1 1 0 0 0 0
0.2412411 -1.2115447 -0.3545057 Yes 0 0 0 0 0 0 1 0 0 1 0
0.2412411 1.4237361 2.0839571 Yes 0 1 0 1 0 0 1 0 0 1 0
0.2412411 -0.3410274 -0.4025331 Yes 0 0 0 0 0 0 1 0 0 1 1
0.2412411 -1.2115447 -1.0081758 Yes 1 0 0 0 0 1 1 0 1 0 0
0.2412411 -0.6437648 0.1822323 Yes 1 0 0 0 0 0 1 0 1 0 0
-2.7489324 -1.6413161 -0.5139802 Yes 0 1 0 0 0 0 1 0 1 0 1
0.2412411 1.5392619 1.2094031 Yes 0 0 0 0 1 0 0 0 1 0 0
-2.7489324 0.1507668 -0.1324981 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 1.3157085 2.0334910 Yes 0 1 0 0 0 0 0 0 0 1 0
0.2412411 0.7317759 0.4795737 Yes 0 1 0 1 0 0 1 0 0 1 0
0.2412411 0.3194795 -0.5684804 Yes 0 1 0 0 0 1 1 0 0 0 1
0.2412411 -0.0491444 0.4592549 Yes 0 1 0 0 1 0 1 0 0 1 0
0.2412411 0.5199157 -0.2061033 Yes 0 0 0 0 0 0 1 0 0 0 1
0.2412411 2.5596922 1.5863190 Yes 1 1 1 0 0 0 1 0 1 0 0
0.2412411 0.5310808 -0.3243523 Yes 0 1 1 0 0 0 1 0 1 0 1
0.2412411 0.3441720 -0.0830606 Yes 0 1 0 1 0 1 1 0 0 0 1
0.2412411 -0.3763676 -0.4011262 Yes 0 1 0 0 0 0 1 0 0 0 1
0.2412411 -1.1842253 -0.5653426 Yes 1 0 0 0 0 0 1 0 0 0 0
-2.7489324 -2.2893571 -0.0385028 Yes 0 1 0 0 1 0 1 0 0 0 0
0.2412411 0.6809554 0.8909165 Yes 0 1 0 1 0 0 1 0 0 1 0

Notice that, by dummy-coding the nominal predictors, we’ve increased the number of columns in our dataset. This is because each factor has been transformed into k-1 dummy variables, with one level held out as the reference (or baseline) level. The baseline level is not visible in our dataset and assigned a value of 0. For a given predictor, if the dummy variables corresponding to every other level is 0, then we default to the baseline. For instance both Property_Area_Urban and Property_Area_Semiurban are 0, then the applicant must be from a Rural area.

K-fold cross-validation

We will stratify on our response variable Loan_Status and use 10 folds to perform stratified cross validation. K-fold cross-validation divides our data into k folds of roughly equal sizes, holds out the first fold as a validation set, and fits the model on the remaining k-1 folds as if they were the training set. This is repeated k times; each time, a different fold is used as a validation set. This results in k estimates of the test MSE (or in the classification case, test error rate).

loan_folds <- vfold_cv(loan_train, v = 10, strata = Loan_Status)

To save computational time, we will save the results to an RDA file; once we have the model we want, we can go and load it later with no time commitment.

save(loan_ds, loan_folds, loan_recipe, loan_train, loan_test, 
     file = "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/loan-setup.rda")

Model Building

It’s time to build our models! For ease of efficiency and access, I will be building each model in a separate R file and saving my results in RDA files. The models will then be loaded below for further exploration. This allows us to streamline our analysis and save on computational time.

For each model, we will:

  1. Set up the model by specifying its type, engine, and mode.
  2. Set up the workflow; add the model and defined recipe.

For models requiring parameter tuning, we’ll complete steps 3-5.

  1. Use grid_regular to set up tuning grids of values for the parameters we’re tuning and specify levels for each.
  2. Fit the models to our folded data via tune_grid().
  3. Select the best value(s) of the parameter(s) based on roc_auc and finalize the workflow.
  4. Fit the final model to entire training set.
  5. Save results to RDA file.

Afterwards, we’ll load back in the saved files, collect error metrics, and analyze their individual performances.

Error metric

The performance metric we’ll be using is roc_auc, which stands for area under the ROC curve. The ROC (receiver operating characteristics) curve is a popular graphic that plots true positive rate (TPR) vs. false positive rate (FPR) at various threshold settings. TPR is sensitivity (proportion of observations that are correctly classified), while FPR is 1-specificity (proportion of observations that are incorrectly classified); the higher the TPR, the better. The AUC (area under curve) is a measure of the diagnostic ability of a classifier, highlighting the trade-off between sensitivity and specificity.

Model Evaluation

It’s time to load our models back in to evaluate their results!

load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/logistic.rda")
load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/knn.rda")
load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/en.rda")
load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/lda.rda")
load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/qda.rda")
load(file= "~/Desktop/School/PSTAT/PSTAT 131/proj-final/rda_files/decision-tree.rda")

Model autoplots

Here, we will visualize the results of our tuned models. We will use the autoplot function to visualize the effect of varying select parameters on the performance of each model according to its impact on our metric of choice.

K-nearest neighbors

For the KNN model, we had 10 different levels of neighbors. In general, the higher the number of neighbors, the greater the roc_auc. The roc_auc score of the best performing model (k=10) is approximately 0.71, which is pretty decent.

autoplot(knn_tune_res)

Elastic net

In our elastic net model, we tuned 2 parameters with 10 levels of each: penalty, the amount of regularization, and mixture, the proportion of lasso penalty (1 for pure lasso, 0 for pure ridge). We can see from the graph that the optimal mixture was 0 (pure ridge). Lower levels of mixture resulted in higher roc_auc scores, and that models performed worse as penalty (amount of regularization) increased.

autoplot(en_tune_res)

In our elastic net model, we tuned 2 parameters with 10 levels of each: penalty, the amount of regularization, and mixture, the proportion of lasso penalty (amount of regularization).

Decision tree

For our decision tree model, we focused on the parameter cost_complexity and tuned it with 10 levels. Oftentimes decision trees can have too many splits, leading to a very complex model that is likely to overfit the data. A smaller tree with fewer splits can address this issue by yielding a simpler model (better interpretation, more bias).

The idea of cost-complexity pruning is similar to that of lasso / ridge regularization: first, we grow a very large tree, then consider a sequenced of pruned sub-trees and select the one that minimizes a penalized error metric. The tuning parameter cost_complexity controls a trade-off between a subtree’s complexity and its fit to the training data; when cost_complexity is 0, it’s the same as the the training error rate; as cost_complexity increases, the tree is penalized for having too many nodes.

We can see from the plot below that a cost-complexity of about 0.25 yields the optimal model (with highest roc_auc). This indicates that pruning was a correct choice. Note that the parameter uses the log10_trans() functions by default, so all of the values in our grid are in the log10 scale.

autoplot(dt_tune_res)

Model selection

Here, we will compare the performance of each model on the training data and create visualization. I’ve created a tibble in order to display the estimated testing roc_auc scores for each fitted model.

log_auc <- augment(log_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)

lda_auc <- augment(lda_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)

qda_auc <- augment(qda_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)

knn_auc <- augment(knn_final_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)

en_auc <- augment(en_final_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)

dt_auc <- augment(dt_final_fit, new_data = loan_train) %>%
  roc_auc(truth = Loan_Status, .pred_Yes) %>%
  select(.estimate)



roc_aucs <- c(log_auc$.estimate,
                           lda_auc$.estimate,
                           qda_auc$.estimate,
                           knn_auc$.estimate,
                           en_auc$.estimate,
                           dt_auc$.estimate)

mod_names <- c("Logistic Regression",
            "LDA",
            "QDA",
            "KNN",
            "Elastic Net",
            "Decision Tree")
mod_results <- tibble(Model = mod_names,
                             ROC_AUC = roc_aucs)

mod_results <- mod_results %>% 
  dplyr::arrange(-roc_aucs)

mod_results

While all of our models performed well, the best-performing model is the KNN model with an roc_auc score of 0.94, with the QDA model close behind at 0.86. I’ve created a lollipop plot below to help visualize these results.

lp_plot <- ggplot(mod_results, aes(x = Model, y = ROC_AUC)) + 
    geom_segment( aes(x = Model, xend = 0, y = ROC_AUC, yend = 0)) +
  geom_point( size=7, color= "black", fill = alpha("blue", 0.3), alpha=0.7, shape=21, stroke=3) +
  labs(title = "Model Results") + 
  theme_minimal()

lp_plot

Results of the Best Models

Now that we’ve identified our best models, we can continue to further analyze their true performance. We will start with the KNN model and also analyze the performance of the decision tree and QDA model as a means of comparison.

KNN model

Performance on the folds

So, the KNN model performed the best overall, but which of value of neighbors yields the best performance?

# select metrics of best knn model
knn_tune_res %>% 
  collect_metrics() %>% 
  dplyr::arrange(mean) %>% 
  slice(10)

KNN model # 10 with 11 predictors, 10 neighbors, and a mean roc_auc score of 69 performed the best! Now that we have our best model, we can fit it to our testing data to explore its true predictive power.

Testing the model

Despite performing well on the training set, the KNN model performed poorly on our test data. In general, an AUC value between 0.7-0.8 is considered acceptable; the KNN models falls 0.2 points short of the lower boundary.

knn10_roc_auc <- augment(knn_final_fit, new_data = loan_test) %>%
  roc_auc(Loan_Status, .pred_Yes)  %>%
  select(.estimate)

knn10_roc_auc 

ROC curve

Below is a confusion matrix of the test results as well as an ROC/AUC plot.

knn_test_results <- augment(knn_final_fit, new_data = loan_test)

knn_test_results %>% 
  conf_mat(truth = Loan_Status, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

knn_test_results %>% 
  roc_curve(Loan_Status, .pred_Yes) %>%
  autoplot()

In general, the more an ROC curve resembles the top left angle of a square, the better the AUC. While our curve is not perfect, it has the correct shape and looks pretty decent.

Here’s a distribution of the predicted probabilities.

knn_test_results %>% 
  ggplot(aes(x = .pred_Yes, fill = Loan_Status)) + 
  geom_histogram(position = "dodge") + theme_bw() +
  xlab("Probability of Yes") +
  scale_fill_manual(values = c("blue", "orange"))

QDA

Now, it’s time to analyze our quadratic discriminant analysis (QDA) classifier. In short, it’s a more advanced version of a LDA model used to find a non-linear decision boundaries between classifiers, assuming each class follows a Gaussian distribution.

Testing the model

To my surprise, the QDA model performed better than the KNN model, though its computed roc_auc score is only slightly higher. Nevertheless, a 0.02 point increase is very significant when it comes to AUC.

qda_roc_auc <- augment(qda_fit, new_data = loan_test, type = 'prob') %>%
  roc_auc(Loan_Status, .pred_Yes) %>%
  select(.estimate)

qda_roc_auc

ROC curve

Instead of fluctuating between concavity and convexity (in the case of KNN), the QDA model’s ROC curve is consistently concave; definitely an improvement.

augment(qda_fit, new_data = loan_test, type = 'prob') %>%
  roc_curve(Loan_Status, .pred_Yes) %>%
  autoplot()

Elastic Net

Lastly, we explore the results of the elastic net model on our test data. First, let’s compute its roc_auc score and then create visualization as needed.

Testing the model

The elastic net model performed the best out of our top 3 models, with an roc_auc score of 0.75.

en_roc_auc <- augment(en_final_fit, new_data = loan_test, type = 'prob') %>%
  roc_auc(Loan_Status, .pred_Yes) %>%
  select(.estimate)

en_roc_auc

ROC curve

Its ROC curve looks much better than that of the KNN model, and is an improvement from the QDA model as well. From approximately 0.5 specificity onward, sensitivity sits near 1.0. This is a good sign!

augment(en_final_fit, new_data = loan_test, type = 'prob') %>%
  roc_curve(Loan_Status, .pred_Yes) %>%
  autoplot()

Conclusion

In this project, we tackled the problem of loan prediction given select demographics specified in applicant profiles. We worked with a relatively small dataset with a large number of features. We tidied the data, performed exploratory analysis, and fit a number of models of varying complexity and flexibility. Through analysis, testing, and assessment, we found the elastic net model to be most optimal for predicting the loan status of an applicant. However, the model was not perfect and leaves room for improvement.

In fact, none of our models performed particularly well for this problem. This can be due to a variety of factors, such as a violation of assumptions and overfitting. None of the models considered were particularly robust in preventing overfitting (with the exception of elastic net).

Both the logistic and LDA model assumes a linear decision boundary, are are prone to error in higher dimensions. The Elastic Net (Ridge) model was efficient for variance reduction, but risk increased bias; in our problem, it seemed to have reduced variance to a greater margin, thus decreasing overall error. The QDA model, while an improvement from LDA and Logistic, is not as flexible as KNN or decision tree. For more complex decision boundaries, a non-parametric approach may be preferred. The decision tree is highly variant and also tends to overfit. KNN doesn’t require linear separability and makes no distributional assumptions; however, it does not model relationships very well and is also prone to overfitting.

Given that the QDA and elastic net (Ridge) models performed relatively well on the test set, we can infer that the relationships in our data are non-linear. A potential improvement would be to consider alternative non-linear models or non-linear extensions to some of our models. Another option would be to consider non-parametric approaches.

As far as our error metric (as measured by roc_auc), the elastic net model outperformed the QDA model on the test set, while under-performing on the training set. It is important to note that both models did not have particularly high predictive accuracy (both around 0.7-0.8), likely due to the fact that neither are optimal for dimensional reduction. A more flexible approach, such as a random forest, may be better suitable for our data. Its removal of redundant features and noise would lead to less misleading data and subsequently an improvement in model accuracy.

It’s also good to acknowledge that none of our models performed particularly poorly either. Loan prediction is no easy feat, and predictive models are undoubtedly prone to nuisance factors and noise. In addition, our dataset was incomplete; inclusion of factors such as an applicant’s age or race would provide a clearer picture of the company’s target demographic and possibly reveal implicit biases in lending. With this understanding, assigning a class label to each applicant based on a select few demographics seem unfair. Instead, applicants should be assessed on a case-by-case basis.